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ABSTRACT 

Publication bias arises whenever the probabihty that a study is 
pubhshed depends on the statistical significance of its results. This 
bias, often called the file-drawer effect since the unpublished rcsTiIts are 
imagined to be tucked away in researchers' file cabinets, is potentially 
a severe impediment to combining the statistical results of studies 
collected from the literature. With almost any reasonable quantitative 
model for publication bias, only a small number of studies lost in the 
file-drawer will produce a significant bias. This result contradicts the 
well known Fail Safe File Drawer (FSFD) method for setting limits on 
the potential harm of publication bias, widely used in social, medical 
and psychic research. This method incorrectly treats the file drawer as 
unbiased, and almost always misestimates the seriousness of publication 
bias. A large body of not only psychic research, but medical and social 
science studies, has mistakenly relied on this method to validate claimed 
discoveries. Statistical combination can be trusted only if it is known 
with certainty that all studies that have been carried out are included. 
Such certainty is virtually impossible to achieve in literature surveys. 
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1. Introduction: Combined Studies 

The goal of many studies in science, medicine, and engineering is the 
measurement of a quantity in order to detect a suspected effect or gain information 
about a known one. Observational errors and other noise sources make this a 
statistical endeavor, in which one obtains repeated measurements in order to 
average out these fluctuations. 

If individual studies are not conclusive, improvement is possible by combining 
the results of different measurements of the same effect. The idea is to perform 
statistical analysis on relevant data collected from the literature^ in order to 
improve the signal-to-noise (on the assumption that the noise averages to zero). A 
good overview of this topic can be found in ([Hedges and Olkin 1985| ). For clarity 
and consistency with most of the literature, throughout this paper individually 
published results will be called studies, and the term analysis will refer to the 
endeavor to combine two or more studies. 

Two problems arise in such analyses. First, experimenters often publish only 
statistical summaries of their studies, not the actual data. The analyst is then 
faced with combining the summaries, a non-trivial technical problem (see e.g. 
Rosenthal, 197% [Rosenthal, 1995| ). Modern communication technology should 



circumvent these problems by making even large data arrays accessible to other 
researchers. Reproducible research ( |Claerbout, 1999| , [Buckheit and Donoho, 1995| ) 



is a discipline for doing this and more, but even this methodology does not solve 



^In some fields this is called meta-analysis, from the Greek meta, meaning behind, 
after, higher or beyond, and often denoting change. It usage here presumably refers 
to new issues arising in the statistical analysis of combined data - such as the file 
drawer effect itself. It is used mostly in scientific terminology, to imply a kind of 
superior or oversight status - as in metaphysics. I prefer the more straightforward 
term combined analysis. 
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the other problem, pubhcation bias. 

2. Publication Bias 

The second problem facing combined analysis is that studies collected from the 
literature are often not a representative sample. Erroneous statistical conclusions 
may result from a prejudiced collection process or if the literature is itself a biased 
sample of all relevant studies. The latter, publication bias, is the subject of this 
paper. The bibliography contains a sample of the rather large literature on the 
file drawer effect and publication bias. The essence of this effect, as described by 
nearly all authors, can be expressed in statistical language as follows: 



Definition: A publication bias exists if the probability that a study reaches 
the literature, and is thus available for combined analysis, depends on the 
results of the study. 



What matters is whether the experimental results are actually used in the combined 
analysis, not just the question of publication. That is, the relevant process is this 
entire sequence, following the initiation of the study: 
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1. the study is carried out to a pre-defined stop point 

2. all data are permanently recorded 

3. data are reduced and analyzed 

4. paper is written 

5. paper is submitted to a journal 

6. journal agrees to consider the paper 

7. referee and author negotiate revisions 

8. referee accepts paper 

9. editor accepts paper for publication 

10. author still wishes to publish paper 

11. author's institution agrees to pay page charges 

12. paper is published 

13. paper is located during literature search 

14. data from paper included in combined analysis 

I refer to this concatenated process loosely as publication, but it is obviously more 
complex than what is usually meant by the term. Some of the steps may seem 
trivial, but all are relevant. Each step involves human decisions and so may be 
influenced by the result of the study. Publication probability may depend on 
the specific conclusion,^] on the size of the effect measured, and on the statistical 
confidence in the result. 

^The literature of social sciences contains horror stories of journal editors and 
others who consider a study worthwhile only if it reaches a statistically significant, 
positive conclusion; that is, an equally significant rejection of a hypothesis is not 
considered worthwhile! 
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Here is the prototype for the analyses treated here: each of the pubhshable 
studies consists of repeated measurements of a quantity, say x. The number of 
measurements is in general different in each such study. The null hypothesis is that 
X is normally distributed, say 

(1) 

this notation means that the errors in x are normally distributed with mean 
and variance a. The results of the study are reported in terms of a shifted {i.e. 
/i is subtracted) and renormalized {i.e. a is divided out) standard normal deviate 
Z = The null hypothesis, usually that n is zero, one-half, or some other 

specific value, yields 

Z~?^(0,1) (2) 

This normalization removes the dependence on the number of repeated 
measurements in the studies. In this step it is of course important that a be well 
estimated, often a tricky business. 

A common procedure for evaluating such studies is to obtain a "p-value" from 
the probability distribution P{Z), and interpret it as providing the statistical 
significance of the result. The discussion here is confined to this approach because 
it has been adopted by most researchers in the relevant fields. However, as noted 
by Sturrock ( [Sturrock, 1994| , |Sturrock, 1997| ) and others ( [Matthews, 1999| , [Jefferys 



iggOl , [Jetterys, 19l?B| , [Berger and Delampady, 19871 , |berger and Hellke, 1987| ), this 
procedure may yield incorrect conclusions - usually overestimating the significance 
of putative anomalous results. The Bayesian methodology is probably the best way 
to treat publication bias ( jGivens, Smith, and 'I'weedie, 19^ [Givens, Smith, an5 



Tweedie, 1997] , pivens. Smith, and Tweedie, 1997| , piggerstaff, 1995| , [BiggerstaE 



Tweedie, and Mengersen, 1994] , [Tweedie, Scott, Biggerstaff and Mengersen, 1996| ) 



This section concludes with some historical notes. Of course science has long 
known the problems of biased samples. The interesting historical note ( [Petticrew] 
1998 ) nominates an utterance by Diagoras of Melos in 500 BC as the first historical 
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mention of publication bia^. See also ( pickersin and Min, 1993 ) for other early 
examples of awareness of publication bias. 



Publication bias is an important problem in medical studies {e.g., [Sacks, IL 



S., Keitman, D., Chalmers, i'. C, & Smith, H., 19831, IPersaud, 19961, lAUison 



Faith, and Gorman, 1996 


, Laupacis, 1997 


, Dickersin, 1997 


, Kleijnen & Knipschild, 



T992| , [Ear ley wine, T99^ , [Helfenstein fc Steiner, 199^ , [l^aber fc Gallce, 1994| ), as 
well as other fields (see l''iske, Kintamaki, fc Karvonen, 1998 , Bauchau, 1997| for 
example). The negative conclusions we shall soon reach about the commonly used 
procedure to deal with this problem yield a discouraging picture of the usefulness 
of combined analysis in all of these contexts. On the other hand, the application 
of modern, sophisticated methods such as those listed above (see also P?aylor and| 
'Iweedie, 1998| , [laylor and 'Iweedie, 1998 , laylor and 'Iweedie, 1998 , laylor aiiH 
Tweedie, 1999[ ) is encouraging. 



3. The "Fail Safe File Drawer" Calculation 

Rosenthal's influential work ([Rosenthal, 1979| , [Rosenthal, 1984[ , [Rosenthal] 



1995[ ) is widely used to set limits on the possibility that the file drawer effect is 
causing a spurious result. One of the clearest descriptions of the overall problem, 
and certainly the most influential in the social sciences, is ([Rosenthal, 1984 [ ) : 



... researchers and statisticians have long suspected that the studies 
published in the behavioral sciences are a biased sample of the studies 



^ Early Greek sailors who escaped from shipwrecks or were saved from drowning 
at sea displayed portraits of themselves in a votive temple on the Aegean island of 
Samothrace, in thanks to Neptune. Answering a claim that these portraits are sure 
proof that the gods really do intervene in human affairs, Diagoras replied "Yea, but 
... where are they painted that are drowned?" 
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that are actually carried out. ... The extreme view of this problem, the 
"file drawer problem," is that the journals are filled with the 5% of the 
studies that show Type I errors, while the file drawers back at the lab 
are filled with the 95% of the studies that show nonsignificant {e.g., 
p > .05) results. 

A Type 1 error is rejection of a true null hypothesis. (Type II is failing to reject a 
false one.) 

This lucid description of the problem is followed by a proposed solution: 

In the past, there was very little we could do to assess the net effect 
of studies tucked away in file drawers that did not make the magic .05 
level . . . Now, however, although no definitive solution to the problem 
is available, we can establish reasonable boundaries on the problem and 
estimate the degree of damage to any research conclusions that could be 
done by the file drawer problem. The fundamental idea in coping with 
the file drawer problem is simply to calculate the number of studies 
averaging null results that must be in the file drawers before the 
overall probability of a Type I error can be just brought to any desired 
level of significance, say p = .05. This number of filed studies, or the 
tolerance for future null results, is then evaluated for whether such a 
tolerance level is small enough to threaten the overall conclusion drawn 
by the reviewer. If the overall level of significance of the research review 
will be brought down to the level of just significant by the addition of 
just a few more null results, the finding is not resistant to the file 
drawer threat. 

The italic emphasis is original; I have indicated what I believe is the fundamental 
flaw in reasoning with boldface. 
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By its very definition, the file drawer is a biased sample. In the nominal 
example given, it is the 95% of the studies that have 5% or greater chance of 
being statistical fluctuations. The mean Z-value in this subsample is not zero, but 
instead 

Zfiied = = -0.1085, (3) 

where Zq is the value corresponding to the adopted signficiance threshold. As we 
will see below, Rosenthal's analysis explicitly assumes that Zfn^^ = for the file 
drawer sample. Because this assumption contradicts the essence of the file drawer 
effect, the quantitative results are incorrect. 



We now recapitulate the analysis given in ([Rosenthal, 1984j ), using slightly 
different notation. For convenience and consistency with the literature, we refer to 
this as the fail safe file drawer, or FSFD, analysis. The basic context is a specific 
collection of published studies having a combined Z that is deemed to be significant 
- that is, the probability that the Z value is due to a statistical fluctuation is below 
some threshold, say 0.05. The question Rosenthal seeks to answer is. How many 
members of a hypothetical set of unpublished studies have to be added to this 
collection in order to bring the mean Z down to a level considered insignificant. 
As argued elsewhere, this does not mirror publication bias, but this is the problem 
addressed. 

Let Npub be the number studies combined in the analysis of the published 
literature (Rosenthal's K), and Nfu^d be the number of studies that are unpublished, 
for whatever reason (Rosenthal's X). Then = Npub + NfUed is the total number 
of studies carried out. 



The basic relation [Equation (5.16) of ([Rosenthal, 19841) ] is 
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This is presumably derived from 



^ g^g ^ NpubZpub + NfiiedZ filed 



i.e. an application of his method of adding weighted Z's as in Equation [5.5] of 



( [Rosenthal, 1984]) , by setting the standard normal deviate of the file drawer, 

Z filed = 0. This is incorrect for a biased file drawer. Equation (|) can be rearranged 

to give 

NpubZp^i, 



Npub 2.706 

which is the equation used widely in the literature, and throughout this paper, to 
compute FSFD estimates. 

What fundamentally went wrong in this analysis and why has it survived 
uncriticized for so long? First, it is simply the case that the notion of null, or 
insignificant results is easily confused with ZfHed = 0. While the latter implies the 
former, the former (which is true, in a sense, for the file drawer) of course, does not 
imply the latter. 

Second, the logic behind the FSFD is quite seductive. Pouring a flood of 
insignificant studies - with normally distributed Z's - into the published sample 
until the putative effect submerges into insignificance is a neat idea. What does 
it mean though? It is indeed a possible measure of the statistical fragility of the 
result obtained from the published sample. On the other hand, there are much 
better and more direct ways of assessing statistical significance, so that 1 do not 
believe that FSFD should be used even in this fashion. 

Third, I believe some workers have mechanically calculated FSFD results, 
found that it justified their analysis or otherwise confirmed their beliefs, and were 
therefore not inclined to look for errors or examine the method critically. A simple 
thought experiment makes the threat of a publication bias with NfHed on the same 
order as Npub clear: Construct a putative file drawer sample by multiplying all 
published Z values by —1; then the total sample then has Z exactly zero, no matter 
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what. 

The two-sided case is often raised in this context. That is to say, pubhcation 
bias might mean that studies with either Z » or Z << are pubhshed in 
preference to those with small \ Z \. This situation could be discussed, but it is a 
red herring in the current context (see e.g. [Iyengar fc Greenhouse, 1988|) . Further, 



none of the empirical distributions I have seen published show any hint of the 
bimodality that would result from this effect, including those in ( Radin, 1997 ). 



In addition, I was first puzzled by statements such as that FSFD analysis 
assesses "tolerance for future null results." The file drawer effect is something 
that has already happened by the time one is preparing a combined analysis. I 
concluded that such expressions are a kind of metaphor for what would happen if 
the studies in the file drawers were suddenly made available for analysis. But even 
if this imagined event were to happen, one would still be left with explaining how 
a biased sample was culled from the total sample. It seems to me that whether 
the result of opening this Pandora's file drawer is to dilute the putative effect 
into insignificance, or not, is of no explanatory value in this context - even if the 
calculation were correct. 

In any case, the bottom line is that the widespread use of the FSFD to 
conclude that various statistical analyses are robust against the file drawer effect 
is wrong, because the underlying calculation is based on an inapplicable premise. 



Other critical comments have been published, including ([Berlin, 1998| , [Sharpe, D 
1997)[ ). Most important is the paper ([Iyengar fc Greenhouse, 1988[) , which makes 
the same point, but further provides an analysis that explicitly accounts for the 
bias (for the case of the truncated selection functions discussed in the next section) , 
These authors note that their formulation "always yields a smaller estimate of the 
fail-safe sample size than does" Rosenthal's. 
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4. Statistical Models 

This section presents a specific model for combined analysis and the file drawer 
effect operating on it. Figure 1 shows the distribution of = 1000 samples from a 



Fig. 1. — Histogram corresponding to the null hypothesis: normal distribution of 
Z values from 1000 independent studies. Those 5% with Z > 1.645 are published 
(open bars), the remainder "tucked away in file drawers" - i.e., unpublished (solid 
bars). Empirical and exact values of Z [cf. Equation (14)] are indicated for the filed 
and published sets. 

normal distribution with zero mean and unit variance, namely 



G{Z) 



N 




(7) 



Pubhcation bias means that the probability of completion of the entire process 
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detailed in Section |^ is a function of the study's reported Z. Note that we are not 
assuming this is the only thing that it depends on. We use the notation 

S{Z) = publication probability (8) 

for the selection function, where < S{Z) < 1. Note S{Z) is not a probability 
distribution over Z, and for example its Z-integral is not constrained to be 1. 

This model captures what almost all authors describe as publication bias (see 
e.g. Hedges, 1992| , Iyengar fc Greenhouse, 1988| ). In view of the complexity of the 



full publication process (see Section H), it is unlikely that its psychological and 
sociological factors can be understood and accurately modeled. I therefore regard 
the function S as unknowable. The approach taken here is to study the generic 
behavior of plausible quantitative models, with no claim to having accurate or 
detailed representations of actual selection functions. 

Taken with the distribution G{Z) in Equation (0), an assumed S{Z) 
immediately yields three useful quantities: the number of studies published 

/oo 
GiZ)SiZ)dZ , (9) 
-oo 

the number consigned to the file drawer 

Nfiied = N- Npub , (10) 

and the expected value of the standard normal deviate Z, averaged over the 
published values 

_ r^ZG{Z)S{Z)dZ _ JZoZG{Z)SiZ)dZ 
^vub p^G{Z)S{Z)dZ iV,„, ^''^ 

The denominator in the last equation takes into account the reduced size of the 
collection of published studies (Nput) relative to the set of all studies performed 
(A^). Absent any publication bias [i.e. S{Z) = 1] Zp^b = from the symmetry of 
the Gaussian, as expected. 
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4.1. Cut Off Selection Functions 

Consider first the following choice for the selection function: 

5(2) = { ; 1 > f, . (12) 

where in principle Zq can have any value, but small positive numbers are of most 
practical interest. That is to say, studies which find an effect at or above the 
significance level corresponding to Zq are always published, while those that don't 
never are. 

Putting this into the above general expressions Equations (|]) and ([TT|) gives 



where erfc is the complementary error function. The mean Z for these published 
studies is 



N 1 r"^ 1^2 , r2, 

Ze-^^ dZ=Ji-)—-^ . (14) 



NpubV2n-^Zo " " ^ Verfc(^ 

Consider the special value Zq = 1.645, corresponding to the classical 95% 
confidence level and obtained by solving for Z the equation 

1 Z 

(with p = .05). This case is frequently taken as a prototype for the file drawer 
( [Rosenthal, 1979| , [Iyengar fc Greenhouse, 1988]) . Equation (p!3[) gives Np^t = 0.05A^; 



that is (by design) 5% of the studies are published and 95% are not. Equation ([1^ 
gives Zpub = 2.0622. 

Let the total number of studies, be = 1000: i.e., one thousand studies, of 
which 50 are published and 950 are not. We have chosen this large value for N, 
here and in Figure 1, for illustration, not realism. For the 50 published studies, 
the combined Z, namely y/EOZp^h, in Equation (|T5|) gives an infinitesimal p-value, 
^ ]^Q-48 _ highly supportive of rejecting the null hypothesis. The FSFD estimate 



4 STATISTICAL MODELS 



17 



of the ratio of filed to published experiments [see Equation(^] is about 78 for 
this case, an overestimate, by a factor of around 4, of the true value of 19. The 
formula of Iyengar and Greenhouse discussed above gives 11.4 for the same ratio, 
an underestimate by a factor of about 1.7. 
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Fig. 2. — Plot of the fraction of the total number of studies that are published (thick 
solid line) and filed (thick dashed line) in the model given by Equation (12). The 
FSFD predictions for the filed fraction are shown as a series of thin dashed lines, 
labled by the value of N, the total number of studies carried out. 

Finally, Figure 2 shows the behavior of the filed and published fractions, as a 
function of Zq, including the fraction of the studies in the file drawer predicted by 
FSFD analysis, given by Equation (|^) above. Two things are evident from this 
comparison: 

• The FSFD prediction is a strong function of A^, whereas the true filed 
fraction is independent of N. 

• the FSFD values are quite far from the truth, except accidentally at a few 
special values of N and Zq. 
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4.2. Step Selection Functions 

Since it is zero over a significant range, the selection function considered in 
the previous section may be too extreme. The following generalization allows any 
value between zero and one for small Z: 

^^^> ~ \ I for Z > Zo . ^^^> 
Much as before, direct integration yields 

N,ub = N[S, + (1 - 5o)^ erfc(^)], (17) 

and 

:2, e"2^o 



^p..-(l-^oJVl-J^^) 

Equation (0) directly determines the ratio of the number of filed studies to the 
number of those published, under the given selection function. The value of Zp^b is 
the quantity that would be used to (incorrectly) reject the hypothesis "No effect is 
present; the sample is drawn from a normal distribution." 
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Fig. 3. — This figure shows the dependence of i? = NfUed/Npub, on the two 
parameters of the step selection function. Contours of logioR are indicated. R 
is large only for a tiny region region at the bottom right of this diagram. 

part of the 5*0 — Zq plane corresponds to a rather small number of unpublished 
studies. In particular, i? ^ 1 in only a small region, namely where simultaneously 
5*0 ~ and Zq » 0. The behavior of Zpy^b is quite simple: roughly speaking 

Zp^, (1 - So)g(Zo) (19) 

where qIZq) is a function on the order of unity or larger for all Zq > 0. Hence the 
bias brought about by even a very small file drawer is large unless So ^ 1 (to be 
expected, because for such values of So the selection function is almost flat). 
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4.3. Smooth Selection Functions 

It might be objected that the selection functions considered here are unreal in 
that they have a discrete step, at Zo. What matters here however, are integrals 
over Z, which do not have any pathological behavior in the presence of such steps. 
Nevertheless I experimented with some smooth selection functions and found 
results that are completely consistent with the conclusions reached in the previous 
sections for step-function choices. 
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5. Conclusions 

Based on the models considered here, we conclude that: 

• apparently significant, but actually spurious, results can arise from publication 
bias, with only a modest number of unpublished studies 

• the widely used Fail Safe File Drawer (FSFD) analysis is irrelevant, because 
it treats the inherently biased file drawer as unbiased, and gives grossly wrong 
estimates of the size of the file drawer 

• statistical combination of studies from the literature can be trusted to be 
unbiased only if there is reason to believe that there are essentially no 
unpublished studies (almost never the case!) 

It is hoped that these results will discourage combined ("meta") analyses based 
on selection from published literature, but encourage methodology to control 
publication bias, such as the establishment of registries (to try to render the 
probability of publication unity once a study is proposed and accepted in the 
registry) . 

The best prospects for establishing conditions under which combined analysis 
might be reasonable even in the face of possible publication bias, seem to lie in a 
fully Bayesian treatment of this problem ( ^turrock, 1994 , Sturrock, 1997 , Givens 



[Smith, and Tweedie, 1997| ). It is possible that the approach discussed in ( |Radin 



Nelson, 1989| ) can lead to improved treatment of publication bias, but one must be 



cautious when experimenting with ad hoc distributions and be aware that small 
errors in fitting the tail of a distribution can be multiplied by extrapolation to the 
body of the distribution. 

Finally, I agree with the Referee, Prof. M. J. Bayarri, who wants to go even 
further, stating: "... the Publication Bias effect should be taken into account even 
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when making conclusions based on a single published experiment." Certainly the 
bias mechanism described in Section ^ can determine which single paper on a given 
topic gets published. To say the same thing, Npub = 1 is a perflectly legitimate case 
in my analysis. Indeed, one can think of several psychological/sociological factors 
that could mihtate against publication of a second paper on a given experiment. 



I am especially grateful to Peter Sturrock for guidance and comments in the 
course of this work. This work was presented at the Peter A. Sturrock Symposium, 
held on March 20, 1999, at Stanford University. I thank Kevin Zahnle, Aaron 
Barnes, Ingram Olkin, Ed May, and Bill Jefferys for valuable suggestions. I am 
grateful to the Referee for making helpful comments and pointing out several 
additional references. I thank Jordan Gruber for calling my attention to the book 
The Conscious Universe ( [Radin, 1997D , and its author. Dean Radin, for helpful 
discussions. None of these statements are meant to imply that any of the persons 
mentioned agrees with the conclusions expressed here. 
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